CMPINF 2100: Homework 07¶

Fengyang Han¶

Assigned: Tuesday of Week 07 at 11:00PM¶

DUE: Tuesday of Week 08 at 11:59PM¶

Overview¶

This assignment is focused on exploring categorical-to-continuous variable relationships and continuous-to-continuous variable relationships. It is not open ended like the last two assignments. There are certain tasks you must complete for all problems, but you will gain experience with the different plot types introduced in the Week 07 recordings. You will practice creating, modifying, interpreting, and communicating insights from them. The last question requires you to visually explore relationships associated with one of the final projects of your choosing.

You must download the 3 data sets provided in the Canvas assignment page and save them to the appropriate directory on your computer.

Collaborators¶

Shiyi Wang

Required tasks for Problem 01, 02, and 04¶

For each of the 3 assigned data sets you must perform the following ESSENTIAL activities:

  • Display the number of rows and columns in the dataset
  • Display the names of the columns and their associated data types
  • Display the number of unique values for each column
  • Display the number of MISSING values for each column

You do NOT need to display basic descriptive statistics and counts. You will visually explore the variables in each problem.

Problem 00¶

You will work with the NumPy, Pandas, matplotlib.pyplot, and Seaborn modules in this assignment.

Import NumPy, Pandas, matplotlib.pyplot, and Seaborn using their commonly accepted aliases.

00) - SOLUTION¶

In [1]:
###
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Problem 01¶

1a)¶

Read in the hw07_prob_01.csv CSV file and assign it to the df01 object.

1a) - SOLUTION¶

In [2]:
df01 = pd.read_csv('hw07_prob_01.csv')

1b)¶

Perform the ESSENTIAL Exploratory Data Analysis (EDA) tasks.

Add as many cells as you feel are necessary.

1b) SOLUTION¶

In [25]:
### 1.1 Basic info on number of rows and columns, names of columns and data types
print(df01.shape)
df01.info()
(2800, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       2800 non-null   object 
 1   value   2800 non-null   float64
dtypes: float64(1), object(1)
memory usage: 43.9+ KB
In [5]:
### 1.2 Check for missing values
df01.isnull().sum()
Out[5]:
x        0
value    0
dtype: int64
In [6]:
### 1.3 Check for unique values
df01.nunique()
Out[6]:
x           4
value    2800
dtype: int64
In [8]:
### 1.4 Describe all columns
df01.describe(include='all') 
Out[8]:
x value
count 2800 2800.000000
unique 4 NaN
top A NaN
freq 700 NaN
mean NaN 3.602424
std NaN 3.092654
min NaN -5.421891
25% NaN 1.301328
50% NaN 3.040259
75% NaN 5.619022
max NaN 20.348324

1c)¶

Create a BAR CHART using Seaborn to show the COUNTS for the non-numeric column in df01.

Are the unique values BALANCED?

1c) - SOLUTION¶

In [11]:
### 2.1 Bar chart for non-numeric columns

sns.catplot(data = df01, x = 'x', kind='count')
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x24ab8404670>

They are balanced.

1d)¶

Create a HISTOGRAM using Seaborn to visualize the marginal distribution of the continuous variable in df01.

Does the marginal distribution appear symmetric?

1d) - SOLUTION¶

In [12]:
### 2.2 Histogram for numeric column marginal distribution

sns.displot(data=df01, x='value', kind='hist')
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x24abed7cc40>

It is not symmetric.

1e)¶

You will now explore the categorical-to-continuous relationship between the non-numeric column and numeric column in df01.

Create a BOX PLOT using Seaborn to visualize the summary statistics of the numeric column GIVEN the non-numeric column.

Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the BOX PLOT?

1e) - SOLUTION¶

In [14]:
### 2.3 Boxplot to visualize the summary statistics of the numeric column GIVEN the non-numeric column

sns.catplot(data=df01, x='x', y='value', kind='box',
            showmeans=True,
            meanprops={'marker':'o','markerfacecolor':'white','markeredgecolor':'black'})
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x24abda81dc0>

Yes, they are different.

1f)¶

Create a POINT PLOT using Seaborn to compare the conditional means of the numeric column GIVEN the non-numeric column.

Are the averages of the numeric column DIFFERENT across the CATEGORIES of the non-numeric column?

1f) - SOLUTION¶

In [16]:
### 2.4 Point plot to compare the conditional means of the numeric column GIVEN the non-numeric column

sns.catplot(data=df01, x='x', y='value', kind='point',join=False)
C:\Users\Fengyeng\AppData\Local\Temp\ipykernel_28176\2359634449.py:3: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data=df01, x='x', y='value', kind='point',join=False)
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x24abed131c0>

Yes, they are different.

1g)¶

Create a VIOLIN PLOT using Seaborn to visualize the conditional density of the numeric column GIVEN the non-numeric column.

Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the VIOLIN PLOT?

1g) - SOLUTION¶

In [37]:
### 2.5 Violin plot to visualize conditional density of the numeric column GIVEN the non-numeric column

sns.catplot(data=df01, x='x', y='value', kind='violin')
Out[37]:
<seaborn.axisgrid.FacetGrid at 0x24ac7693460>

Yes, they are different.

1h)¶

Create a CONDITIONAL KDE plot using Seaborn to show the conditional density of the numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the KDE color.

Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the CONDITIONAL KDE plot?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?

HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?

1h) - SOLUTION¶

In [29]:
### 2.6 Conditional KDE plot to show conditional density of the numeric column GIVEN the non-numeric column, column associated with KDE color

sns.displot(data=df01, x='value', hue='x', kind='kde',common_norm=False)
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x24ac0181d60>

Yes, they are different.

1i)¶

Create a FACTED HISTOGRAM plot using Seaborn to show the conditional histogram of the numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the COLUMN FACETS. The x and y scales of the facets must be free or not-shared across the facets.

Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the FACTED HISTOGRAM?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?

1i) - SOLUTION¶

In [36]:
### 2.7 Facet histogram to show conditional histogram of numeric column GIVEN the non-numeric column, the non-numeric column as the facet, the x and y are not shared

sns.displot(data=df01, x='value', col='x', kind='hist', 
            aspect=0.75,
            facet_kws={'sharex':False,'sharey':False})
Out[36]:
<seaborn.axisgrid.FacetGrid at 0x24ac5e1f3a0>

Yes, they are different.

1j)¶

You have explored the CONDITIONAL DISTRIBUTIONS of the numeric column GIVEN the non-numeric column.

Which plot types made it easy to COMPARE summary statistics across the categories?

Which plot types made it easy to COMPARE the distributional SHAPE across the categories?

1j) - SOLUTION¶

What do you think?

Box plot made it easy to compare summary statistics across the categories.

Faceted histogram made it easy to compare the distributional shape across the categories.

Problem 02¶

2a)¶

Read in the hw07_prob_02.csv CSV file and assign it to the df02 object.

2a) - SOLUTION¶

In [38]:
###
df02 = pd.read_csv('hw07_prob_02.csv')

2b)¶

Perform the ESSENTIAL Exploratory Data Analysis (EDA) tasks.

Add as many cells as you feel are necessary.

2b) SOLUTION¶

In [39]:
### 1.1 Basic info on number of rows and columns, names of columns and data types
print(df02.shape)
df02.info()
(900, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      900 non-null    float64
 1   x2      900 non-null    float64
 2   m1      900 non-null    object 
dtypes: float64(2), object(1)
memory usage: 21.2+ KB
In [40]:
### 1.2 Check for missing values
df02.isnull().sum()
Out[40]:
x1    0
x2    0
m1    0
dtype: int64
In [41]:
### 1.3 Check for unique values
df02.nunique()
Out[41]:
x1    900
x2    900
m1      9
dtype: int64
In [43]:
### 1.4 Describe all columns
df02.describe(include='all')
Out[43]:
x1 x2 m1
count 900.000000 900.000000 900
unique NaN NaN 9
top NaN NaN A
freq NaN NaN 100
mean 0.019472 0.039430 NaN
std 1.038965 1.037111 NaN
min -3.435065 -2.940527 NaN
25% -0.705076 -0.707391 NaN
50% -0.020579 0.043096 NaN
75% 0.728071 0.735318 NaN
max 3.068722 2.886683 NaN

2c)¶

Create a BAR CHART using Seaborn to show the COUNTS for the non-numeric column in df02.

Are the unique values BALANCED?

2c) - SOLUTION¶

In [44]:
### 2.1 Bar chart for non-numeric columns
sns.catplot(data=df02, x='m1', kind='count')
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x24ac7b71580>

Yes, they are balanced.

2d)¶

Create HISTOGRAMS using Seaborn to visualize the marginal distributions of the continuous variables in df02.

You may create separate figures for each histogram based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate histograms.

Do the marginal distribution appear symmetric?

2d) - SOLUTION¶

In [52]:
### 2.2 Histogram for numeric column marginal distribution
sns.displot(data=df02, x='x1', kind='hist', bins=20)
Out[52]:
<seaborn.axisgrid.FacetGrid at 0x24acb414d90>
In [69]:
sns.displot(data=df02, x='x2', kind='hist', bins=20)
Out[69]:
<seaborn.axisgrid.FacetGrid at 0x24ace651520>
In [55]:
### 2.2.0 Melt the data frame for numeric columns
df02_lf = df02.reset_index().\
rename(columns={'index':'id'}).\
melt(id_vars=['id','m1'], value_vars=['x1','x2'])
In [54]:
### 2.2.1 Histogram for numeric column marginal distribution
sns.displot(data=df02_lf, x='value', kind='hist',col='variable',bins= 20)
Out[54]:
<seaborn.axisgrid.FacetGrid at 0x24acb0d3d90>

I notice slight difference in x2 figure between value 2 and 3, due to different binning scheme.

They are kind of symmetric.

2e)¶

Create CONDITIONAL KDE plots using Seaborn to show the conditional densities of each numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the KDE color.

You may create separate figures for each histogram based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate histograms.

Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the CONDITIONAL KDE plot?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?

HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?

2e) - SOLUTION¶

In [71]:
### 2.6 Conditional KDE plot to show conditional density of the numeric column GIVEN the non-numeric column, column associated with KDE color
sns.displot(data=df02_lf, x='value', hue='m1', kind='kde',col = 'variable' ,common_norm=False)
Out[71]:
<seaborn.axisgrid.FacetGrid at 0x24ace6fac70>

No. They are kind of similar.

2f)¶

Create BOX PLOTS using Seaborn to visualize the summary statistics of the numeric columns GIVEN the non-numeric column.

You may create separate figures for each boxplot based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate boxplots.

Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the BOX PLOT?

2f) - SOLUTION¶

In [72]:
### 2.3 Boxplot to visualize the summary statistics of the numeric column GIVEN the non-numeric column

sns.catplot(data=df02_lf, x='m1', y='value', kind='box',col='variable')
Out[72]:
<seaborn.axisgrid.FacetGrid at 0x24ace6499d0>

No, they are not appearently different.

2g)¶

Although there are several other CONDITIONAL DISTRIBUTION related figures to make, let's shift focus to the RELATIONSHIP between two continuous variables.

Create a scatter plot between the continuous variables using Seaborn.

Can you see any clear relationships between the two?

2g) - SOLUTION¶

In [78]:
### 2.4 Scatter plot to visualize the relationship between the numeric columns GIVEN the non-numeric column

sns.relplot(data=df02, x='x1', y='x2', kind='scatter')
Out[78]:
<seaborn.axisgrid.FacetGrid at 0x24ad13fa760>

No, I cannot see any clear relationships between the two.

2h)¶

Let's now check if the continuous variable relationship depends on the non-numeric variable.

Create a scatter plot between the continuous variables using Seaborn. Color the markers based on the non-numeric column to study if the relationship CHANGES across the categories.

Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?

2h) - SOLUTION¶

In [79]:
### 2.4.1 Scatter plot to visualize the relationship between the numeric columns GIVEN the non-numeric column, with markers colored by the non-numeric column

sns.relplot(data=df02, x='x1', y='x2', kind='scatter', hue='m1')
Out[79]:
<seaborn.axisgrid.FacetGrid at 0x24ad149da60>

Yes, the appear different.

2i)¶

Let's include a TREND line within the scatter plot to help visualize the linear relationship between the two continuous variables. Let's begin by IGNORING the potential influence of the non-numeric column.

Create a scatter plot which includes a trend line to show the linear relationship between the two numeric columns. You should NOT color based on the non-numeric columnn.

What kind of relationship does the TREND line represent when the non-numeric column is ignored?

2i) - SOLUTION¶

In [80]:
### 2.5 Trend plot
sns.lmplot(data=df02, x='x1', y='x2')
Out[80]:
<seaborn.axisgrid.FacetGrid at 0x24ace4e27c0>

The trend line shows that there is little relationship between the two numeric columns.

2j)¶

Let's now include TREND lines that are associated with the categories of the non-numeric column.

Create a scatter plot which includes trend lines to show the linear relationship between the numeric columns. Color the markers and the trend lines based on the non-numeric column.

Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?

2j) - SOLUTION¶

In [81]:
###
sns.lmplot(data=df02, x='x1', y='x2', hue='m1')
Out[81]:
<seaborn.axisgrid.FacetGrid at 0x24ad14edb80>

Yes, it shows many different relationships between the two numeric columns between different categories.

2k)¶

Lastly, let's FACET by the non-numeric column!

Create a scatter plot which includes trend lines to show the linear relationship between the numeric columns. Color the markers and trend lines and FACET based on the non-numeric column. The color and facets are therefore associated with the SAME variable.

The facets should have 3 columns per row.

Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?

2k) - SOLUTION¶

In [84]:
### 2.7 FACET non-numeric column
sns.lmplot(data=df02, x='x1', y='x2', hue='m1', col='m1', col_wrap=3)
Out[84]:
<seaborn.axisgrid.FacetGrid at 0x24ad62c8700>

Yes, it shows many different relationships between the two numeric columns between different categories.

Problem 03¶

You will continue working with the data from Problem 02 to explore the relationship between the two continuous variables.

3a)¶

Linear relationships can be summarized by calculating the correlation coefficient between the numeric columns. The correlation coefficients can be visualized as correlation plots via heat maps. However, let's first practice calculating the correlation matrix between the two numeric columns in df02.

Display the correlation matrix for the numeric columns in df02 to the screen. You do NOT need to assign the correlation matrix to an object.

3a) - SOLUTION¶

In [86]:
### 3.1 Correlation matrix for numeric columns in df02

df02.corr(numeric_only=True)
Out[86]:
x1 x2
x1 1.000000 0.021982
x2 0.021982 1.000000

3b)¶

Let's now VISUALIZE the correlation plot as a heat map!

Create a correlation plot between the numeric columns in df02. The correlation plot must be created using Seaborn. You must use a DIVERGING color palette with the correct bounds and midpoint. The correlation plot must be annotated.

You must ignore the non-numeric column for this correlation plot.

3b) - SOLUTION¶

In [87]:
### 3.2 Heatmap for correlation matrix
sns.heatmap(df02.corr(numeric_only=True), 
            vmin=-1, vmax=1, center=0, cmap='coolwarm',
            annot=True,annot_kws={'size':20},
            cbar=False,
            fmt='.2f')
Out[87]:
<AxesSubplot: >

3c)¶

Let's now examine if the correlation plot CHANGES across the categories of the non-numeric column. However, let's practice calculating the grouped correlation matrix BEFORE visualizing the correlation plot.

Display the grouped correlation matrix for the numeric columns in df02 to the screen. You must group by the non-numeric column. You do NOT need to assign the correlation matrix to an object.

3c) - SOLUTION¶

In [88]:
### 3.3 Grouped correlation matrix
df02.groupby('m1').corr(numeric_only=True)
Out[88]:
x1 x2
m1
A x1 1.000000 -0.991282
x2 -0.991282 1.000000
B x1 1.000000 -0.880486
x2 -0.880486 1.000000
C x1 1.000000 -0.722998
x2 -0.722998 1.000000
D x1 1.000000 -0.395593
x2 -0.395593 1.000000
E x1 1.000000 -0.059890
x2 -0.059890 1.000000
F x1 1.000000 0.270515
x2 0.270515 1.000000
G x1 1.000000 0.785730
x2 0.785730 1.000000
H x1 1.000000 0.902762
x2 0.902762 1.000000
I x1 1.000000 0.992068
x2 0.992068 1.000000

3d)¶

Let's now VISUALIZE the grouped correlation plot!

Create a grouped correlation plot between the numeric columns in df02. You must group by the non-numeric column. The separate categories on the non-numeric column must be associated with separate subplots. The subplot title must be specified correctly to make it clear which subplot is associated with which value of the non-numeric column. You must use a DIVERGING color palette with the correct bounds and midpoint. The correlation plot must be annotated.

3d) - SOLUTION¶

In [109]:
### 3.4 Grouped heatmap for correlation matrix
the_groups = df02.m1.unique().tolist()

corr_per_group = df02.groupby('m1').corr(numeric_only=True)

fig,axs = plt.subplots(len(the_groups),1,figsize=(5,50),sharex=True,sharey=True)

for ix in range(len(the_groups)):
    sns.heatmap(corr_per_group.loc[the_groups[ix],:], 
                vmin=-1, vmax=1, center=0, cmap='coolwarm',
                annot=True,annot_kws={'size':10},
                cbar=False,
                fmt='.2f',
                ax=axs[ix])
    axs[ix].set_title('m1: %s' % the_groups[ix])

3e)¶

You have visualized the distributions and relationship between the continuous variables in df02 several ways. Let's conclude by working with a plot type that combines both aspects into a single graphic.

Create a PAIRS PLOT to show the marginal histograms and scatter plot between the numeric columns in df02. You must ignore the non-column.

3e) - SOLUTION¶

In [115]:
### 3.5 Pairplot for numeric columns in df02
sns.pairplot(data=df02, vars=['x1','x2'],
             diag_kws={'common_norm':False})
Out[115]:
<seaborn.axisgrid.PairGrid at 0x24ad95a4520>

3f)¶

CONDITIONAL DISTRIBUTIONS and CONDITIONAL RELATIONSHIPS can be shown within a PAIRS PLOT. The non-numeric column can be associated with COLOR which creates separate colored CONDITIONAL DISTRIBUTIONS and separate colored MARKERS within the SCATTER PLOTS. You must COLOR the PAIRS PLOT by the non-numeric column.

HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?

3f) - SOLUTION¶

In [116]:
### 3.6 Grouped pairplot for numeric columns in df02
sns.pairplot(data=df02, vars=['x1','x2'], hue='m1',
                diag_kws={'common_norm':False})
Out[116]:
<seaborn.axisgrid.PairGrid at 0x24ad95a41f0>

3g)¶

You have visually explore the relationship between the numeric columns many different ways. You ignored the non-numeric column, as well as examined if the relationship CHANGED across the categories of the non-numeric column.

Which plot type did you feel was the easiet for identifying if the relationship changed across the categories of the non-numeric column?

3g) - SOLUTION¶

What do you think?

The pairplot is the easiest for identifying if the relationship changed across the categories of the non-numeric column, however, heatmap is more clear to show the correlation between the two numeric columns.

Problem 04¶

4a)¶

Read in the hw07_prob_04.csv CSV file and assign it to the df04 object.

4a) - SOLUTION¶

In [117]:
###
df04 = pd.read_csv('hw07_prob_04.csv')

4b)¶

Perform the ESSENTIAL Exploratory Data Analysis (EDA) tasks.

Add as many cells as you feel are necessary.

4b) SOLUTION¶

In [118]:
### 1.1 Basic info on number of rows and columns, names of columns and data types
print(df04.shape)
df04.info()
(633, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633 entries, 0 to 632
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x01     633 non-null    float64
 1   x02     633 non-null    float64
 2   x03     633 non-null    float64
 3   x04     633 non-null    float64
 4   x05     633 non-null    float64
 5   x06     633 non-null    float64
 6   x07     633 non-null    float64
 7   x08     633 non-null    float64
 8   x09     633 non-null    float64
 9   x10     633 non-null    float64
 10  x11     633 non-null    float64
 11  x12     633 non-null    float64
 12  v       633 non-null    object 
dtypes: float64(12), object(1)
memory usage: 64.4+ KB
In [119]:
### 1.2 Check for missing values
df04.isnull().sum()
Out[119]:
x01    0
x02    0
x03    0
x04    0
x05    0
x06    0
x07    0
x08    0
x09    0
x10    0
x11    0
x12    0
v      0
dtype: int64
In [120]:
### 1.3 Check for unique values
df04.nunique()
Out[120]:
x01    633
x02    633
x03    633
x04    633
x05    633
x06    633
x07    633
x08    633
x09    633
x10    633
x11    633
x12    633
v        3
dtype: int64
In [121]:
### 1.4 Describe all columns
df04.describe(include='all')
Out[121]:
x01 x02 x03 x04 x05 x06 x07 x08 x09 x10 x11 x12 v
count 633.000000 633.000000 633.000000 633.000000 633.000000 633.000000 633.000000 633.000000 633.000000 633.000000 633.000000 633.000000 633
unique NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3
top NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN A1
freq NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 211
mean 0.017201 -0.117542 0.028487 -0.098241 0.053754 -0.062829 0.078069 -0.030272 0.028753 -0.062405 0.025930 -0.044316 NaN
std 0.985951 1.007731 3.752440 0.991107 0.984573 0.949131 1.022223 1.035076 3.897745 1.003886 0.998188 0.971035 NaN
min -2.962974 -3.314671 -7.369698 -2.603420 -3.378890 -3.024685 -2.985966 -2.777091 -7.422036 -2.802904 -3.056180 -2.715199 NaN
25% -0.645987 -0.832902 -3.563217 -0.783992 -0.623991 -0.698531 -0.600231 -0.714685 -3.907429 -0.715740 -0.663492 -0.718663 NaN
50% 0.010903 -0.087405 0.040706 -0.111189 -0.000370 -0.068309 0.052151 -0.024770 0.046817 -0.065431 0.036369 -0.024997 NaN
75% 0.661162 0.533091 3.743382 0.597446 0.714583 0.573939 0.807849 0.703130 3.955639 0.632654 0.674309 0.640892 NaN
max 2.805809 2.773446 7.370209 2.883285 2.948060 2.660166 2.950523 2.759512 6.958443 2.998344 2.975531 2.948998 NaN

4c)¶

Create a BAR CHART using Seaborn to show the COUNTS for the non-numeric column in df04.

Are the unique values BALANCED?

4c) - SOLUTION¶

In [123]:
### 2.1 Bar chart for non-numeric columns
sns.catplot(data=df04, x='v', kind='count')
Out[123]:
<seaborn.axisgrid.FacetGrid at 0x24ad95a4b50>

Yes, they are balanced.

4d)¶

It is best to study the marginal distributions and then conditional distributions associated with continuous variables (numeric columns) BEFORE exploring the relationships between them. However, we will modify the typical EDA workflow for this problem. Let's jump to using the PAIRS PLOT which allows exploring distributions and relationships within a single graphic. We will revisit the distributions in more detail later.

Create a PAIRS PLOT associated with all numeric columns in df04 using Seaborn.

What does this specific PAIRS PLOT reveal about the variables and their relationships?

4d) - SOLUTION¶

In [124]:
### 2.2 Pairplot for numeric columns in df04
sns.pairplot(data=df04, diag_kws={'common_norm':False})
Out[124]:
<seaborn.axisgrid.PairGrid at 0x24ac1a4df40>

This plot shows that some complicated relationships between the variables, with multiple lines in the scatter plot.

Most of the continuous variables are symmetric and have a shape similar to normal distribution.

4e)¶

Let's now examine if the non-numeric column impacts the continuous variables. Create a PAIRS PLOT for the numeric columns and COLOR based on the non-numeric column using Seaborn.

What does this specific grouped PAIRS PLOT reveal about the impact of the non-numeric column on the continuous variables?

HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?

4e) - SOLUTION¶

In [125]:
### 2.3 non-numeric column as the difference and colors.
sns.pairplot(data=df04, hue='v', diag_kws={'common_norm':False})
Out[125]:
<seaborn.axisgrid.PairGrid at 0x24ae2f273a0>

4f)¶

Let's now summarize the linear relationships between numeric columns using a CORRELATION PLOT. You do NOT need to display the correlation matrix first this time. Instead, we will jump straight to visualizing the CORRELATION PLOT.

Create a correlation plot between the numeric columns in df04. The correlation plot must be created using Seaborn. You must use a DIVERGING color palette with the correct bounds and midpoint.

Do you feel this correlation plot needs to be annotated? Try annoting the correlation plot and then NOT annotating it. Are you able to reach the same conclusions without the annotated text?

You must ignore the non-numeric column for this correlation plot.

4f) - SOLUTION¶

In [127]:
### 2.4 Linear relationships between the numeric columns using correlation plot

# annot = True
sns.heatmap(df04.corr(numeric_only=True), 
            vmin=-1, vmax=1, center=0, cmap='coolwarm',
            annot=True,annot_kws={'size':10},
            cbar=False,
            fmt='.2f')
Out[127]:
<AxesSubplot: >
In [128]:
# annot = False
sns.heatmap(df04.corr(numeric_only=True), 
            vmin=-1, vmax=1, center=0, cmap='coolwarm',
            annot=False,annot_kws={'size':20},
            cbar=False,
            fmt='.2f')
Out[128]:
<AxesSubplot: >

I think it's ok not to annotate the correlation plot, since the darkness of different colors are clear enough to show the whether it is positive correlation or not, and the degree of correlation between the variables.

I can draw same conclusions without the annotated text.

4g)¶

Let's now group the correlation plot by the non-numeric column.

Create a grouped correlation plot between the numeric columns in df04. You must group by the non-numeric column. The separate categories on the non-numeric column must be associated with separate subplots. The subplot title must be specified correctly to make it clear which subplot is associated with which value of the non-numeric column. You must use a DIVERGING color palette with the correct bounds and midpoint.

Do you feel this correlation plot needs to be annotated? Try annoting the correlation plot and then NOT annotating it. Are you able to reach the same conclusions without the annotated text?

4g) - SOLUTION¶

In [133]:
### 2.5 Linear relationships between the numeric columns using heatmap and grouped by the non-numeric column
the_groups = df04.v.unique().tolist()

corr_per_group = df04.groupby('v').corr(numeric_only=True)

fig,axs = plt.subplots(len(the_groups),1,figsize=(8,25),sharex=True,sharey=True)

for ix in range(len(the_groups)):
    sns.heatmap(corr_per_group.loc[the_groups[ix],:], 
                vmin=-1, vmax=1, center=0, cmap='coolwarm',
                annot=True,annot_kws={'size':10},
                cbar=False,
                fmt='.2f',
                ax=axs[ix])
    axs[ix].set_title('v: %s' % the_groups[ix])
In [134]:
# without annot
the_groups = df04.v.unique().tolist()

corr_per_group = df04.groupby('v').corr(numeric_only=True)

fig,axs = plt.subplots(len(the_groups),1,figsize=(8,25),sharex=True,sharey=True)

for ix in range(len(the_groups)):
    sns.heatmap(corr_per_group.loc[the_groups[ix],:], 
                vmin=-1, vmax=1, center=0, cmap='coolwarm',
                annot=False,annot_kws={'size':10},
                cbar=False,
                fmt='.2f',
                ax=axs[ix])
    axs[ix].set_title('v: %s' % the_groups[ix])

I think it's ok not to annotate the correlation plot, since the darkness of different colors are clear enough to show the whether it is positive correlation or not, and the degree of correlation between the variables.

I can draw same conclusions without the annotated text.

4h)¶

What were the pros and cons of exploring the RELATIONSHIPS between numeric columns with a PAIRS PLOTS for this data set?

What were the pros and cons of exploring the LINEAR relationships between the numeric columns with CORRELATION PLOTS for this data set?

4h) - SOLUTION¶

What do you think?

Problem 05¶

Let's now return explore the continuous variable distributions in depth for df04. You have seen that there are more than just a few continuous variables in this data set! It might seem like we need to perform a lot of tedious actions to explore all of the variables. But, you do NOT need to manually create all figures! You do NOT need to resort to for-loops either! Instead, the data can be RESHAPED from the current WIDE-FORMAT to LONG-FORMAT. This allows associating Seaborn's FACETS with the continuous variables!

5a)¶

First, display the number of rows and columns in df04 as a reminder.

5a) - SOLUTION¶

In [135]:
###
print(df04.shape)
(633, 13)

5b)¶

Reshape the df04 WIDE-FORMAT DataFrame into LONG-FORMAT. The numeric columns of df04. MUST be "gathered up" or STACKED on top of each other. The non-numeric column must NOT be gathered up. You MUST include a column named rowid that corresponds to the row index. The rowid column must NOT be gathered up with the other numeric columns.

Assign the LONG-FORMAT data set to the lf04 object.

Display the .info() method for the LONG-FORMAT object to the screen.

5b) - SOLUTION¶

In [143]:
###
df04_features = df04.select_dtypes('number').copy()
df04_objects = df04.select_dtypes('object').copy()

id_cols = ['rowid'] + df04_objects.columns.tolist()

lf04 = df04.reset_index().\
rename(columns={'index':'rowid'}).\
melt(id_vars=id_cols, value_vars=df04_features.columns)

lf04.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7596 entries, 0 to 7595
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   rowid     7596 non-null   int64  
 1   v         7596 non-null   object 
 2   variable  7596 non-null   object 
 3   value     7596 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 237.5+ KB
In [144]:
lf04
Out[144]:
rowid v variable value
0 0 A1 x01 1.264427
1 1 A1 x01 1.192453
2 2 A1 x01 0.687623
3 3 A1 x01 -0.440204
4 4 A1 x01 -0.017212
... ... ... ... ...
7591 628 C3 x12 -0.633067
7592 629 C3 x12 -1.049238
7593 630 C3 x12 -0.912559
7594 631 C3 x12 -0.627877
7595 632 C3 x12 -0.748772

7596 rows × 4 columns

5c)¶

How many rows and columns are in lf04?

Does the number of rows "make sense" given the shape of df04?

5c) - SOLUTION¶

7596 rows and 4 columns.

The rows number is 633 * 12, which is the number of numeric elements in df04.

5d)¶

You can now use the LONG-FORMAT data to visually explore the numeric columns in df04!

Visualize the marginal distributions for each numeric variable in df04 using the LONG-FORMAT lf04 object and Seaborn. You must associate the correct newly created "gathered" value column with the x axis argument. You must associate the column facets with the correct newly created "gathered" variable column. You must use 21 bins to create the histograms. The figure should have 4 facets per row. The x and y scales of the facets must be free or not-shared across the facets.

How would you describe the SHAPES of the continuous variable distributions?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?

5d) - SOLUTION¶

In [154]:
###

sns.displot(data=lf04, x='value', col='variable', kind='hist', 
            facet_kws={'sharex':False,'sharey':False},
            col_wrap=4,height=2,aspect=1,
            bins = 21)
Out[154]:
<seaborn.axisgrid.FacetGrid at 0x24af20cf700>

Their shapes are mostly similar to normal distribution, but some of them are like combinations of multiple normal distributions.

5e)¶

The lf04 LONG-FORMAT DataFrame has a separate column for the non-numeric column in df04. Thus, it was NOT "gathered" with the numeric columns. You can therefore use the non-numeric column as a GROUPING variable in the visualizations!

Visualize the CONDITIONAL KDE plots for each numeric variable in df04 within FACETS of a single figure. Each facet must be associated with one of the "original" numeric columns in df04. You must associate the correct newly created "gathered" value column in the x axis argument. You must associate the column facets with the correct newly created "gathered" variable column. You must associate the "original" df04 non-numeric column with the CONDITIONAL KDE color. The figure should have 4 facets per row. The x and y scales of the facets must be free or not-shared across the facets.

Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT across the categories of the non-numeric column?

HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?

HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?

5e) - SOLUTION¶

In [155]:
###

sns.displot(data=lf04, x='value', col='variable', kind='kde',hue='v',
            col_wrap=4,height=2,aspect=1,
            facet_kws={'sharex':False,'sharey':False},
            common_norm=False)
Out[155]:
<seaborn.axisgrid.FacetGrid at 0x24a91013310>

For variable x03,x09, they are different across the categories of the non-numeric column.

5f)¶

Although there are multiple conditional distribution plots we should use to fully explore the data, you will conclude this assignment with a BOXPLOT. You will create separate BOXPLOTS for each "original" numeric column within FACETS of a single figure. Each facet must be associated with one of the "original" numeric columns in df04. You must associate the "original" df04 non-numeric column with the x axis argument. You must associate the correct newly created "gathered" value column with the y axis argument. You must associate the column facets with the correct newly created "gathered" variable column.

Experiment with using shared x and y axis scales across the FACETS and NOT SHARING the x and y axis scales. Which approach seems best for this particular data set?

5f) - SOLUTION¶

In [158]:
###

sns.catplot(data=lf04, x='v', y='value', col='variable',kind='box',
            sharex=False,sharey=False,
            col_wrap=4,height=2,aspect=1)        
Out[158]:
<seaborn.axisgrid.FacetGrid at 0x24ace65a850>
In [160]:
sns.catplot(data=lf04, x='v', y='value', col='variable',kind='box',
            sharex=True,sharey=True,
            col_wrap=4,height=2,aspect=1)
Out[160]:
<seaborn.axisgrid.FacetGrid at 0x24acf7ba070>

Not share x and y is better.

Problem 06¶

You must download the data associated with one of the Final Projects from the Canvas site. Save the file(s) in the same directory as this Jupyter notebook. You may use the same project as the previous assignment OR switch to a different project.

Read in the data associated with one of the Final Projects. You previously visually explored MARGINAL behavior. You must now begin to visually explore relationships between variables in the Project data. However, you do NOT need to explore ALL relationships this assignment.

You MUST create at least 6 plots which explore relationships between variables. Those plots can be categorical-to-categorical relationships (combinations), categorical-to-continuous relationships, and/or continuous-to-continuous relationships. The exact type of plots you should use depend on the project.

However, 2 of the plots MUST involve MORE than 2 variables.

Add as many cells as you feel are necessary.

06) - SOLUTION¶

In [213]:
df_input = pd.read_csv('trial_inputs.csv')
df_output = pd.read_csv('trial_outputs.csv')

df_output_max_cycle = df_output.groupby('trial_id').last().reset_index()
In [217]:
# concat input and output
df_trial = pd.concat([df_input,df_output_max_cycle[['cycle','y']]],axis=1)
In [218]:
### 1.1 Basic info on number of rows and columns, names of columns and data types
print(df_trial.shape)
df_trial.info()
(240, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   trial_id  240 non-null    int64  
 1   x1        240 non-null    int64  
 2   x2        240 non-null    int64  
 3   x3        240 non-null    int64  
 4   x4        240 non-null    int64  
 5   x5        240 non-null    int64  
 6   x6        240 non-null    object 
 7   cycle     240 non-null    int64  
 8   y         240 non-null    float64
dtypes: float64(1), int64(7), object(1)
memory usage: 17.0+ KB
In [219]:
### 1.2 Check for missing values
df_trial.isnull().sum()
Out[219]:
trial_id    0
x1          0
x2          0
x3          0
x4          0
x5          0
x6          0
cycle       0
y           0
dtype: int64
In [220]:
### 1.3 Check for unique values
df_trial.nunique()
Out[220]:
trial_id    240
x1            3
x2            3
x3            2
x4            2
x5            3
x6            4
cycle        48
y           240
dtype: int64
In [221]:
### 1.4 Describe all columns
df_trial.describe(include='all')
Out[221]:
trial_id x1 x2 x3 x4 x5 x6 cycle y
count 240.00000 240.000000 240.000000 240.00000 240.00000 240.000000 240 240.000000 240.000000
unique NaN NaN NaN NaN NaN NaN 4 NaN NaN
top NaN NaN NaN NaN NaN NaN A NaN NaN
freq NaN NaN NaN NaN NaN NaN 60 NaN NaN
mean 120.50000 0.000000 0.000000 0.00000 0.00000 0.000000 NaN 72.629167 0.000267
std 69.42622 0.731823 0.731823 1.00209 1.00209 0.731823 NaN 30.387594 0.000040
min 1.00000 -1.000000 -1.000000 -1.00000 -1.00000 -1.000000 NaN 7.000000 0.000169
25% 60.75000 -1.000000 -1.000000 -1.00000 -1.00000 -1.000000 NaN 47.000000 0.000236
50% 120.50000 0.000000 0.000000 0.00000 0.00000 0.000000 NaN 96.000000 0.000271
75% 180.25000 1.000000 1.000000 1.00000 1.00000 1.000000 NaN 100.000000 0.000282
max 240.00000 1.000000 1.000000 1.00000 1.00000 1.000000 NaN 100.000000 0.000355
In [273]:
### 1.5 wide to long
df_trial_lf = df_trial.reset_index().\
rename(columns={'index':'id'}).\
melt(id_vars=['id','trial_id','cycle','y','x6'], value_vars=df_input.columns.tolist())
In [274]:
df_trial_lf
Out[274]:
id trial_id cycle y x6 variable value
0 0 1 39 0.000263 A x1 -1
1 1 2 52 0.000279 A x1 1
2 2 3 38 0.000266 A x1 -1
3 3 4 50 0.000287 A x1 1
4 4 5 40 0.000268 A x1 -1
... ... ... ... ... ... ... ...
1195 235 236 100 0.000301 D x5 1
1196 236 237 83 0.000262 D x5 1
1197 237 238 100 0.000345 D x5 0
1198 238 239 100 0.000345 D x5 0
1199 239 240 81 0.000265 D x5 0

1200 rows × 7 columns

In [275]:
### 2.1 Bar chart for non-numeric columns with facet on variable
sns.displot(data=df_trial_lf, x='value', col='variable', kind='hist', 
            facet_kws={'sharex':False,'sharey':False},
            col_wrap=3,height=3,aspect=1,
            bins = 6)
Out[275]:
<seaborn.axisgrid.FacetGrid at 0x24b906850a0>
In [276]:
### 2.2 Histogram for numeric column marginal distribution
sns.displot(data=df_trial, x='y', kind='hist', bins=20, height=4, aspect=2)
Out[276]:
<seaborn.axisgrid.FacetGrid at 0x24a9df99820>
In [282]:
### 2.3 Boxplot to visualize the summary statistics of the numeric column GIVEN the non-numeric column
sns.catplot(data=df_trial_lf, x = 'x6', y='cycle', kind='box',
            col='variable',col_wrap=3,height=3,aspect=1,
            sharex=False,sharey=False,
            showmeans=True,
            meanprops={'marker':'o','markerfacecolor':'white','markeredgecolor':'black'})
Out[282]:
<seaborn.axisgrid.FacetGrid at 0x24b94385070>
In [280]:
### 2.4 Point plot to compare the conditional means of the numeric column GIVEN the non-numeric column
sns.catplot(data=df_trial_lf, x = 'x6', y='cycle', kind='point',join=False,
            col='variable',col_wrap=3,height=3,aspect=1)
C:\Users\Fengyeng\AppData\Local\Temp\ipykernel_28176\2004049749.py:2: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data=df_trial_lf, x = 'x6', y='cycle', kind='point',join=False,
Out[280]:
<seaborn.axisgrid.FacetGrid at 0x24b94022130>
In [291]:
## 2.5 Pairplot for numeric columns in df_trial
sns.pairplot(data=df_trial, vars=['trial_id','y','cycle'],
             diag_kws={'common_norm':False})
Out[291]:
<seaborn.axisgrid.PairGrid at 0x24afa93d1c0>
In [290]:
### Heatmap for correlation matrix

sns.heatmap(df_trial[['trial_id','cycle','y']].corr(numeric_only=True), 
            vmin=-1, vmax=1, center=0, cmap='coolwarm',
            annot=True,annot_kws={'size':10},
            cbar=False,
            fmt='.2f')
Out[290]:
<AxesSubplot: >